Background:
Banks loan out money to customers to finance a car, a house, pay for education, consolidate loans, etc. Borrowers agree to pay back the money with an interest on a monthly basis. Sometimes, due to unexpected circumstances, some borrowers are not able to pay back the money.
Research Objectives:
The banks would want to see what is the pattern in the customers to predict if a customer can pay back the loan, so the bank knows who to lend out the money. I would like to predict if any customer goes to a bank, should the bank loan out the money to the customer base on model learned from this dataset.
Research Questions:
What factors contribute/correlated most to bank loan status?
Can we predict if a borrower will be able to pay the debt in full?
What Machine Learning algorithms perform best in the prediction? (First Guess)
Optimize all (or the best) algorithms. With the fine tuned hyperparameters, what is the best prediction performance? Which algorithm?
Dataset
The dataset is taken from Kaggle (https://www.kaggle.com/zaurbegiev/my-dataset). There are over 100,000 rows and 19 columns (features) in this dataset. The predicted feature variable is Loan_Status, which is a categorical variable with value either “Fully Paid” or “Charged off”. Fully Paid means the borrower can pay back the debt, while charged off means the borrower is unlikely pay the bank after a substantial delinquent for a period of time. The remainder of the debt is sometimes collected by a third-party agency.
LoanID,
CustomerID,
Loan_Status,
Current_Loan_Amount,
Term,
Credit_Score,
Annual_Income,
Years_in_current_job,
Home_Ownership,
Purpose,
Monthly_Debt,
Years_of_Credit_History,
Months_since_last_delinquent,
Number_of_Open_Accounts,
Number_of_Credit_Problems,
Current_Credit_Balance,
Maximum_Open_Credit,
Bankruptcies,
Tax_Liens
library(caret)
library(dplyr)
library(tidyverse)
library(mlbench)
library(gmodels)
# for multiple plots in one figure (ggarrange)
library(ggpubr)
library(ggplot2)
library(clustertend)
library(factoextra)
library(corrplot)
library(cluster)
library(magrittr)
library(fpc)
library(pheatmap)
library(clValid)
theme_set(theme_classic())
# Clear memory
rm(list=ls())
# read in data
BankLoan_dataset <- read.csv("LoanStatus.csv")
# remove unrelevant fields
BankLoan_dataset <- subset(BankLoan_dataset, select=-c(LoanID, CustomerID))
# data structure
knitr::kable(str(BankLoan_dataset))
## 'data.frame': 100000 obs. of 17 variables:
## $ Loan_Status : chr "Fully Paid" "Fully Paid" "Fully Paid" "Fully Paid" ...
## $ Current_Loan_Amount : int 445412 262328 99999999 347666 176220 206602 217646 648714 548746 215952 ...
## $ Term : chr "Short Term" "Short Term" "Short Term" "Long Term" ...
## $ Credit_Score : int 709 NA 741 721 NA 7290 730 NA 678 739 ...
## $ Annual_Income : int 1167493 NA 2231892 806949 NA 896857 1184194 NA 2559110 1454735 ...
## $ Years_in_current_job : chr "8 years" "10+ years" "8 years" "3 years" ...
## $ Home_Ownership : chr "Home Mortgage" "Home Mortgage" "Own Home" "Own Home" ...
## $ Purpose : chr "Home Improvements" "Debt Consolidation" "Debt Consolidation" "Debt Consolidation" ...
## $ Monthly_Debt : num 5215 33296 29201 8742 20640 ...
## $ Years_of_Credit_History : num 17.2 21.1 14.9 12 6.1 17.3 19.6 8.2 22.6 13.9 ...
## $ Months_since_last_delinquent: int NA 8 29 NA NA NA 10 8 33 NA ...
## $ Number_of_Open_Accounts : int 6 35 18 9 15 6 13 15 4 20 ...
## $ Number_of_Credit_Problems : int 1 0 1 0 0 0 1 0 0 0 ...
## $ Current_Credit_Balance : int 228190 229976 297996 256329 253460 215308 122170 193306 437171 669560 ...
## $ Maximum_Open_Credit : int 416746 850784 750090 386958 427174 272448 272052 864204 555038 1021460 ...
## $ Bankruptcies : int 1 0 0 0 0 0 1 0 0 0 ...
## $ Tax_Liens : int 0 0 0 0 0 0 0 0 0 0 ...
|| || || ||
# data summary
knitr::kable(head(BankLoan_dataset))
| Loan_Status | Current_Loan_Amount | Term | Credit_Score | Annual_Income | Years_in_current_job | Home_Ownership | Purpose | Monthly_Debt | Years_of_Credit_History | Months_since_last_delinquent | Number_of_Open_Accounts | Number_of_Credit_Problems | Current_Credit_Balance | Maximum_Open_Credit | Bankruptcies | Tax_Liens |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Fully Paid | 445412 | Short Term | 709 | 1167493 | 8 years | Home Mortgage | Home Improvements | 5214.74 | 17.2 | NA | 6 | 1 | 228190 | 416746 | 1 | 0 |
| Fully Paid | 262328 | Short Term | NA | NA | 10+ years | Home Mortgage | Debt Consolidation | 33295.98 | 21.1 | 8 | 35 | 0 | 229976 | 850784 | 0 | 0 |
| Fully Paid | 99999999 | Short Term | 741 | 2231892 | 8 years | Own Home | Debt Consolidation | 29200.53 | 14.9 | 29 | 18 | 1 | 297996 | 750090 | 0 | 0 |
| Fully Paid | 347666 | Long Term | 721 | 806949 | 3 years | Own Home | Debt Consolidation | 8741.90 | 12.0 | NA | 9 | 0 | 256329 | 386958 | 0 | 0 |
| Fully Paid | 176220 | Short Term | NA | NA | 5 years | Rent | Debt Consolidation | 20639.70 | 6.1 | NA | 15 | 0 | 253460 | 427174 | 0 | 0 |
| Charged Off | 206602 | Short Term | 7290 | 896857 | 10+ years | Home Mortgage | Debt Consolidation | 16367.74 | 17.3 | NA | 6 | 0 | 215308 | 272448 | 0 | 0 |
knitr::kable(summary(BankLoan_dataset))
| Loan_Status | Current_Loan_Amount | Term | Credit_Score | Annual_Income | Years_in_current_job | Home_Ownership | Purpose | Monthly_Debt | Years_of_Credit_History | Months_since_last_delinquent | Number_of_Open_Accounts | Number_of_Credit_Problems | Current_Credit_Balance | Maximum_Open_Credit | Bankruptcies | Tax_Liens | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:100000 | Min. : 10802 | Length:100000 | Min. : 585 | Min. : 76627 | Length:100000 | Length:100000 | Length:100000 | Min. : 0 | Min. : 3.6 | Min. : 0.0 | Min. : 0.00 | Min. : 0.0000 | Min. : 0 | Min. :0.000e+00 | Min. :0.0000 | Min. : 0.00000 | |
| Class :character | 1st Qu.: 179652 | Class :character | 1st Qu.: 705 | 1st Qu.: 848844 | Class :character | Class :character | Class :character | 1st Qu.: 10214 | 1st Qu.:13.5 | 1st Qu.: 16.0 | 1st Qu.: 8.00 | 1st Qu.: 0.0000 | 1st Qu.: 112670 | 1st Qu.:2.734e+05 | 1st Qu.:0.0000 | 1st Qu.: 0.00000 | |
| Mode :character | Median : 312246 | Mode :character | Median : 724 | Median : 1174162 | Mode :character | Mode :character | Mode :character | Median : 16220 | Median :16.9 | Median : 32.0 | Median :10.00 | Median : 0.0000 | Median : 209817 | Median :4.679e+05 | Median :0.0000 | Median : 0.00000 | |
| NA | Mean :11760447 | NA | Mean :1076 | Mean : 1378277 | NA | NA | NA | Mean : 18472 | Mean :18.2 | Mean : 34.9 | Mean :11.13 | Mean : 0.1683 | Mean : 294637 | Mean :7.608e+05 | Mean :0.1177 | Mean : 0.02931 | |
| NA | 3rd Qu.: 524942 | NA | 3rd Qu.: 741 | 3rd Qu.: 1650663 | NA | NA | NA | 3rd Qu.: 24012 | 3rd Qu.:21.7 | 3rd Qu.: 51.0 | 3rd Qu.:14.00 | 3rd Qu.: 0.0000 | 3rd Qu.: 367959 | 3rd Qu.:7.830e+05 | 3rd Qu.:0.0000 | 3rd Qu.: 0.00000 | |
| NA | Max. :99999999 | NA | Max. :7510 | Max. :165557393 | NA | NA | NA | Max. :435843 | Max. :70.5 | Max. :176.0 | Max. :76.00 | Max. :15.0000 | Max. :32878968 | Max. :1.540e+09 | Max. :7.0000 | Max. :15.00000 | |
| NA | NA | NA | NA’s :19154 | NA’s :19154 | NA | NA | NA | NA | NA | NA’s :53141 | NA | NA | NA | NA’s :2 | NA’s :204 | NA’s :10 |
BankLoan_dataset$Years_in_current_job = extract_numeric(BankLoan_dataset$Years_in_current_job)
BankLoan_dataset$Years_in_current_job <- as.numeric(BankLoan_dataset$Years_in_current_job)
# # identify outliers
# outliers <- boxplot(BankLoan_dataset$Annual_Income, plot=FALSE)$out
# # remove outliers
# BankLoan_dataset <- BankLoan_dataset[-which(BankLoan_dataset$Annual_Income %in% outliers), ]
#
We cannot remove outliers for individual feature one at a time, because after removing the outliers, some of the records are removing. The rows will be mismatched when we combine the features together. Therefore, we have to use the pipe function like below.
# show outliers before removing them
histogram(BankLoan_dataset$Annual_Income)
bankloan <- BankLoan_dataset %>%
# convert string feature into categorical factors
mutate_if(is.character, as.factor) %>%
# remove the nulls
drop_na() %>%
### Remove outliers
# Annual_Income
filter(between(Annual_Income,
quantile(Annual_Income, 0.25) - 1.5* IQR(Annual_Income),
quantile(Annual_Income, 0.75) + 1.5* IQR(Annual_Income))) %>%
# Current_Loan_Amount
filter(between(Current_Loan_Amount,
quantile(Current_Loan_Amount, 0.25) - 1.5* IQR(Current_Loan_Amount),
quantile(Current_Loan_Amount, 0.75) + 1.5* IQR(Current_Loan_Amount))) %>%
# Credit_Score
filter(between(Credit_Score,
quantile(Credit_Score, 0.25) - 1.5* IQR(Credit_Score),
quantile(Credit_Score, 0.75) + 1.5* IQR(Credit_Score))) %>%
#Number_of_Credit_Problems (there is an outlier of 15)
filter(between(Number_of_Credit_Problems,
quantile(Number_of_Credit_Problems, 0.25) - 1.5* IQR(Number_of_Credit_Problems),
4)) %>%
# Monthly_Debt
filter(between(Monthly_Debt,
quantile(Monthly_Debt, 0.25) - 1.5* IQR(Monthly_Debt),
quantile(Monthly_Debt, 0.75) + 1.5* IQR(Monthly_Debt))) %>%
# Maximum_Open_Credit
filter(between(Maximum_Open_Credit,
quantile(Maximum_Open_Credit, 0.25) - 1.5* IQR(Maximum_Open_Credit),
quantile(Maximum_Open_Credit, 0.75) + 1.5* IQR(Maximum_Open_Credit))) %>%
#remove null after filling outliers with NA
drop_na()
knitr::kable(str(bankloan))
## 'data.frame': 25185 obs. of 17 variables:
## $ Loan_Status : Factor w/ 2 levels "Charged Off",..: 2 2 2 1 2 2 1 2 2 1 ...
## $ Current_Loan_Amount : int 217646 548746 234124 317108 465410 449108 688468 129712 287980 374176 ...
## $ Term : Factor w/ 2 levels "Long Term","Short Term": 2 2 2 1 1 2 1 2 2 1 ...
## $ Credit_Score : int 730 678 727 687 688 718 682 723 737 652 ...
## $ Annual_Income : int 1184194 2559110 693234 1133274 1722654 1454507 1494616 1465698 1013954 1239199 ...
## $ Years_in_current_job : num 1 2 10 8 3 8 1 10 1 10 ...
## $ Home_Ownership : Factor w/ 4 levels "HaveMortgage",..: 2 4 4 4 4 2 4 3 2 2 ...
## $ Purpose : Factor w/ 16 levels "Business Loan",..: 4 4 4 4 3 4 4 4 4 10 ...
## $ Monthly_Debt : num 10855 18660 14211 9633 15647 ...
## $ Years_of_Credit_History : num 19.6 22.6 24.7 17.4 22.3 28.8 16.6 19.4 18.6 36.6 ...
## $ Months_since_last_delinquent: int 10 33 46 53 30 21 50 6 13 42 ...
## $ Number_of_Open_Accounts : int 13 4 10 4 7 14 8 34 11 10 ...
## $ Number_of_Credit_Problems : int 1 0 1 0 0 0 0 1 0 0 ...
## $ Current_Credit_Balance : int 122170 437171 28291 60287 107559 193990 343995 45106 223117 126350 ...
## $ Maximum_Open_Credit : int 272052 555038 107052 126940 488356 458414 843854 163218 489302 415602 ...
## $ Bankruptcies : int 1 0 1 0 0 0 0 0 0 0 ...
## $ Tax_Liens : int 0 0 0 0 0 0 0 0 0 0 ...
|| || || ||
knitr::kable(summary(bankloan))
| Loan_Status | Current_Loan_Amount | Term | Credit_Score | Annual_Income | Years_in_current_job | Home_Ownership | Purpose | Monthly_Debt | Years_of_Credit_History | Months_since_last_delinquent | Number_of_Open_Accounts | Number_of_Credit_Problems | Current_Credit_Balance | Maximum_Open_Credit | Bankruptcies | Tax_Liens | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Charged Off: 4628 | Min. : 21450 | Long Term : 6672 | Min. :647.0 | Min. : 111245 | Min. : 1.000 | HaveMortgage : 59 | Debt Consolidation:19640 | Min. : 0 | Min. : 3.80 | Min. : 0.00 | Min. : 2.00 | Min. :0.0000 | Min. : 0 | Min. : 0 | Min. :0.0000 | Min. :0.00000 | |
| Fully Paid :20557 | 1st Qu.:160336 | Short Term:18513 | 1st Qu.:702.0 | 1st Qu.: 882911 | 1st Qu.: 3.000 | Home Mortgage:12444 | other : 1633 | 1st Qu.:10559 | 1st Qu.:14.20 | 1st Qu.: 17.00 | 1st Qu.: 8.00 | 1st Qu.:0.0000 | 1st Qu.: 93841 | 1st Qu.: 231880 | 1st Qu.:0.0000 | 1st Qu.:0.00000 | |
| NA | Median :259886 | NA | Median :719.0 | Median :1185961 | Median : 6.000 | Own Home : 2175 | Home Improvements : 1555 | Median :15993 | Median :17.40 | Median : 32.00 | Median :10.00 | Median :0.0000 | Median : 170221 | Median : 384648 | Median :0.0000 | Median :0.00000 | |
| NA | Mean :289590 | NA | Mean :715.1 | Mean :1277038 | Mean : 6.201 | Rent :10507 | Other : 812 | Mean :17067 | Mean :18.64 | Mean : 35.36 | Mean :11.04 | Mean :0.1873 | Mean : 210122 | Mean : 448406 | Mean :0.1241 | Mean :0.02974 | |
| NA | 3rd Qu.:392436 | NA | 3rd Qu.:734.0 | 3rd Qu.:1580040 | 3rd Qu.:10.000 | NA | Business Loan : 357 | 3rd Qu.:22590 | 3rd Qu.:21.90 | 3rd Qu.: 52.00 | 3rd Qu.:13.00 | 3rd Qu.:0.0000 | 3rd Qu.: 285703 | 3rd Qu.: 613426 | 3rd Qu.:0.0000 | 3rd Qu.:0.00000 | |
| NA | Max. :789096 | NA | Max. :751.0 | Max. :2960447 | Max. :10.000 | NA | Medical Bills : 312 | Max. :43110 | Max. :70.50 | Max. :176.00 | Max. :48.00 | Max. :4.0000 | Max. :2028877 | Max. :1302114 | Max. :4.0000 | Max. :4.00000 | |
| NA | NA | NA | NA | NA | NA | NA | (Other) : 876 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
#check if there is null
is.null(bankloan)
## [1] FALSE
# showing no outliers
histogram(bankloan$Annual_Income)
Histogram shows outliers are removed. Annual income is skewed right. After removing outliers and Null values, there are still 26,500 data left to work with.
# BAR graph for categorical variables
ggplot(bankloan, aes(x=Purpose)) +
geom_bar() +
coord_flip()
gg_status <- ggplot(bankloan, aes(x=Loan_Status)) +
geom_bar()
gg_home <- ggplot(bankloan, aes(x=Home_Ownership)) +
geom_bar() +
coord_flip()
gg_problem <- ggplot(bankloan, aes(x=Number_of_Credit_Problems)) +
geom_bar()
gg_job <- ggplot(bankloan, aes(x=Years_in_current_job)) +
geom_bar()
# arrange multiple plots in one figure
figure <- ggarrange(gg_status, gg_home, gg_problem, gg_job,
ncol = 2, nrow = 2,
legend="none")
figure
# Boxplot for quantitative variables
ggplot(bankloan, aes(x=Loan_Status, y=Annual_Income)) +
geom_boxplot()
ggplot(bankloan, aes(x=Loan_Status, y=Current_Loan_Amount)) +
geom_boxplot()
ggplot(bankloan, aes(x=Loan_Status, y=Credit_Score)) +
geom_boxplot()
ggplot(bankloan, aes(x=Loan_Status, y=Monthly_Debt)) +
geom_boxplot()
From the bar graphs, we can see that the data is imbalanced, Paid Fully is much more than Charge Off. The home_ownership is have_mortgage most and rent is secondly. Most people have zero number of credit problems. Most people work 10+ years in current job. The most purpose of loans is Debt Consolidation.
From the Boxplots, comparing annual income, current loan amount, monthly debt and credit score, annual income seems to be the biggest difference between Fully Paid and Charged Off customers.
# proportion of Paid-fully and Charged-off
table(bankloan$Loan_Status)
##
## Charged Off Fully Paid
## 4628 20557
round(prop.table(table(bankloan$Loan_Status)) * 100, 1)
##
## Charged Off Fully Paid
## 18.4 81.6
# Average group by loan status
# Annual_Income is column5, credit score is column4
aggregate(bankloan[, 4:5], list(bankloan$Loan_Status), median)
## Group.1 Credit_Score Annual_Income
## 1 Charged Off 718 1116250
## 2 Fully Paid 719 1212561
Numerical statistics shows Fully Paid is 81.8% of the total data. The median annual income for fully paid customers is 1.23M and 1.12M for charged off customers. Since the annual income is right-skew, median is used instead of mean.
set.seed(9650)
# Divide data into train and test sets
indexTrain <- createDataPartition(y=bankloan$Loan_Status, p=0.75, list=FALSE)
training <- bankloan[indexTrain, ]
testing <- bankloan[-indexTrain, ]
# Since there are 26000 of data, take a subset of data for visual clarity
training <- training[1:2000, ]
testing <- testing[1:500, ]
# Downsampling to balance data
train <- downSample(x=training[, -ncol(training)], y=training$Loan_Status)
test <- downSample(x=testing[, -ncol(testing)], y=testing$Loan_Status)
table(train$Loan_Status)
##
## Charged Off Fully Paid
## 375 375
table(test$Loan_Status)
##
## Charged Off Fully Paid
## 93 93
Data are balanced with downsampling.
# Using balanced data. Taking only quantitative variables
trainQ <- train[, c(2,4,5,6,9,10,11,12,14,15)]
testQ <- test[, c(2,4,5,6,9,10,11,12,14,15)]
# Transform data, applying PCA. Taking Annual_Income as the predicted feature
preProc <- preProcess(trainQ[,-3], method=c('BoxCox', 'center', 'scale', 'pca'), Comp=7)
# Applying PCA, predict train data without predicted feature
train_pca <- predict(preProc, trainQ[, -3])
# Adding column that needs to be predicted
train_pca$Annual_Income <- trainQ$Annual_Income
# Examing
head(train_pca)
## PC1 PC2 PC3 PC4 PC5 PC6
## 1 -1.2384075 0.35990852 0.65806763 0.2956382 -0.2399866 -2.1875899
## 2 0.6974771 -1.36486957 -2.09931766 1.4259874 0.2323596 -1.1441729
## 3 0.6328001 0.02077489 -0.04028476 1.1319218 -1.2986672 0.1996309
## 4 1.8143158 -0.26663688 -0.69619182 -2.4765150 -1.3897335 -0.4726096
## 5 -0.7621477 2.33690630 0.45145079 0.5155694 0.4110872 0.2193689
## 6 2.8200037 -0.59302518 -0.44706622 0.4545443 0.7041540 -0.1431837
## PC7 PC8 Annual_Income
## 1 -1.19400314 -0.18413220 1839485
## 2 0.13067386 -0.52625933 1407900
## 3 -0.41993545 1.24339543 912475
## 4 1.46893381 0.91780031 597493
## 5 0.22624107 -0.53556678 1630998
## 6 0.01013493 -0.08170745 590254
# Applying PCA, predict test data without predicted feature
test_pca <- predict(preProc, testQ[, -3])
# Adding column that needs to be predicted
test_pca$Annual_Income <- testQ$Annual_Income
head(test_pca)
## PC1 PC2 PC3 PC4 PC5 PC6
## 1 2.2459600 -1.8836049 -1.0282400 0.463054216 -0.853784239 -0.218438214
## 2 1.4532355 0.4486031 -1.6191810 -0.177268346 1.853533874 -0.460450740
## 3 -2.7134421 -0.7239373 -0.3312994 -2.223264592 0.145007336 0.004597723
## 4 0.8443699 -0.2447363 0.4355276 -0.389615315 -1.701711222 0.093582695
## 5 0.1342537 0.2137294 0.4170945 0.002490191 -0.003199088 -0.993774020
## 6 0.7799616 -2.9695812 -0.7796513 0.097079307 -1.140649816 1.294890401
## PC7 PC8 Annual_Income
## 1 0.91988009 -1.0360133 653904
## 2 0.94072197 -0.3827557 929575
## 3 0.02907999 -0.7162236 1143363
## 4 -0.88834340 -0.5001129 1188127
## 5 -0.09279641 0.7185523 1398096
## 6 -0.23816557 0.9481315 1045000
# Fitting linear model using components
fit <- lm(Annual_Income~., data=train_pca)
print(fit$coefficients)
## (Intercept) PC1 PC2 PC3 PC4 PC5
## 1231675.86 -134713.29 73955.85 -35581.26 35790.16 -11429.15
## PC6 PC7 PC8
## -73420.25 -159213.59 -91150.09
summary(fit)
##
## Call:
## lm(formula = Annual_Income ~ ., data = train_pca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -755216 -303966 -80171 215266 1621992
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1231676 15225 80.898 < 2e-16 ***
## PC1 -134713 9186 -14.665 < 2e-16 ***
## PC2 73956 13777 5.368 1.07e-07 ***
## PC3 -35581 14334 -2.482 0.0133 *
## PC4 35790 15162 2.360 0.0185 *
## PC5 -11429 16918 -0.676 0.4995
## PC6 -73420 17992 -4.081 4.98e-05 ***
## PC7 -159214 19058 -8.354 3.24e-16 ***
## PC8 -91150 20975 -4.346 1.58e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 417000 on 741 degrees of freedom
## Multiple R-squared: 0.3278, Adjusted R-squared: 0.3206
## F-statistic: 45.17 on 8 and 741 DF, p-value: < 2.2e-16
# Predicting on test set
predictions <- predict(fit, test_pca)
head(predictions)
## 1 2 3 4 5 6
## 816741.6 1018085.2 1534549.3 1269987.2 1236923.1 807664.6
# Evaluation, root mean square error
rmse <- RMSE(predictions, testQ$Annual_Income)
rmse
## [1] 402142.3
error_rate <- rmse/mean(testQ$Annual_Income)
error_rate
## [1] 0.3246781
All Principal Components except PC5 are statistically significant looking from the p-values. Prediction with PCA gives RMSE of 402142 or 32.4%. The RMSE is large because the data have very large number units. R2 is 32%.
# Taking a subset of quantitative variables for easier showing the concepts. Taking only 300 observation because the data is too big to visualize
bankloan_sub <- bankloan[1:300, c('Annual_Income', 'Monthly_Debt', 'Months_since_last_delinquent', 'Maximum_Open_Credit')] # 'Maximum_Open_Credit', 'Credit_Score',
Status_sub <- bankloan[1:300, 'Loan_Status']
# Summarizing the data
summary(bankloan_sub)
## Annual_Income Monthly_Debt Months_since_last_delinquent
## Min. : 371564 Min. : 1006 Min. : 2.00
## 1st Qu.: 874385 1st Qu.: 9742 1st Qu.:19.00
## Median :1199594 Median :14782 Median :36.00
## Mean :1271442 Mean :16435 Mean :37.65
## 3rd Qu.:1580159 3rd Qu.:22311 3rd Qu.:54.00
## Max. :2913270 Max. :42775 Max. :81.00
## Maximum_Open_Credit
## Min. : 47432
## 1st Qu.: 229515
## Median : 377564
## Mean : 446304
## 3rd Qu.: 613261
## Max. :1252878
# Scale the data
bankloan_scaled <- scale(bankloan_sub)
summary(bankloan_scaled)
## Annual_Income Monthly_Debt Months_since_last_delinquent
## Min. :-1.7116 Min. :-1.7341 Min. :-1.67735
## 1st Qu.:-0.7552 1st Qu.:-0.7522 1st Qu.:-0.87749
## Median :-0.1367 Median :-0.1858 Median :-0.07763
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.5872 3rd Qu.: 0.6605 3rd Qu.: 0.76928
## Max. : 3.1229 Max. : 2.9606 Max. : 2.03964
## Maximum_Open_Credit
## Min. :-1.4138
## 1st Qu.:-0.7684
## Median :-0.2436
## Mean : 0.0000
## 3rd Qu.: 0.5918
## Max. : 2.8588
# Creating and printing Principal Components using function prcomp
# habillage: an optional factor variable for coloring the observations by groups
fviz_pca_ind(prcomp(bankloan_scaled), title='Bank Loan Status Data', habillage=Status_sub, palette='jco', geom='point', ggtheme=theme_classic(), legend='bottom')
# Counting variation captured by each Principal Component
PCA_bankloan <- prcomp(bankloan_scaled)
summary(PCA_bankloan)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.3602 1.0258 0.8073 0.6678
## Proportion of Variance 0.4625 0.2631 0.1629 0.1115
## Cumulative Proportion 0.4625 0.7256 0.8885 1.0000
print(PCA_bankloan) # same as rotation/direction
## Standard deviations (1, .., p=4):
## [1] 1.3601776 1.0257757 0.8073244 0.6677787
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Annual_Income 0.58744397 -0.1669828 0.5055315 -0.60947866
## Monthly_Debt 0.62100519 -0.1107894 0.1577300 0.75973647
## Months_since_last_delinquent 0.01269062 0.9330071 0.3558844 0.05179782
## Maximum_Open_Credit 0.51874954 0.2988983 -0.7700035 -0.22057480
PCA_bankloan$center # at the origin
## Annual_Income Monthly_Debt
## 1.390092e-16 2.575486e-17
## Months_since_last_delinquent Maximum_Open_Credit
## 7.039508e-17 8.977194e-17
PCA_bankloan$rotation
## PC1 PC2 PC3 PC4
## Annual_Income 0.58744397 -0.1669828 0.5055315 -0.60947866
## Monthly_Debt 0.62100519 -0.1107894 0.1577300 0.75973647
## Months_since_last_delinquent 0.01269062 0.9330071 0.3558844 0.05179782
## Maximum_Open_Credit 0.51874954 0.2988983 -0.7700035 -0.22057480
# Visualization, Counting variation captured by each Principal Component
eigen_exp <- fviz_eig(PCA_bankloan)
eigen_exp
# Visualization, Principal Component as linear combination of each of the variables
eigen_contri1 <- fviz_contrib(PCA_bankloan, choice='var', axes=1)
eigen_contri1
eigen_contri2 <- fviz_contrib(PCA_bankloan, choice='var', axes=2)
eigen_contri2
eigen_contri1_2 <- fviz_contrib(PCA_bankloan, choice='var', axes=1:2)
eigen_contri1_2
# Quantitative
var <- get_pca_var(PCA_bankloan)
head(var$contrib)
## Dim.1 Dim.2 Dim.3 Dim.4
## Annual_Income 34.50904123 2.788325 25.556209 37.1464239
## Monthly_Debt 38.56474505 1.227429 2.487876 57.7199503
## Months_since_last_delinquent 0.01610518 87.050224 12.665370 0.2683014
## Maximum_Open_Credit 26.91010853 8.934022 59.290545 4.8653244
Looking at the clusters, Charge-Off and Fully-Paid look overlapping. Taking only four predictor features with original unbalanced data, PC1 accounts for 46.3% of data variation, and PC2 accounts for 26.3% of data variation. Monthly_Debt and Annual_Income contributes more than 25% for PC1. Months_since_last_delinquent contributes 80% of PC2.
# Shuffling data
train <- train[sample(nrow(train)), ]
# Using the downsampled balanced data (train). Taking a subset of quantitative variables for easier showing the concepts.
bankloan_sub <- train[1:300, c('Annual_Income', 'Monthly_Debt', 'Months_since_last_delinquent', 'Maximum_Open_Credit')] # 'Maximum_Open_Credit', 'Credit_Score',
Status_sub <- train[1:300, 'Loan_Status']
# Summarizing the data
summary(bankloan_sub)
## Annual_Income Monthly_Debt Months_since_last_delinquent
## Min. : 284354 Min. : 280.6 Min. : 1.00
## 1st Qu.: 855508 1st Qu.:10369.1 1st Qu.:18.00
## Median :1127289 Median :15891.0 Median :35.00
## Mean :1218081 Mean :16610.6 Mean :36.74
## 3rd Qu.:1536758 3rd Qu.:21993.4 3rd Qu.:54.00
## Max. :2880210 Max. :42506.2 Max. :83.00
## Maximum_Open_Credit
## Min. : 0
## 1st Qu.: 240432
## Median : 375100
## Mean : 437638
## 3rd Qu.: 587950
## Max. :1212134
# Scale the data
bankloan_scaled <- scale(bankloan_sub)
summary(bankloan_scaled)
## Annual_Income Monthly_Debt Months_since_last_delinquent
## Min. :-1.8834 Min. :-1.89061 Min. :-1.61011
## 1st Qu.:-0.7313 1st Qu.:-0.72261 1st Qu.:-0.84418
## Median :-0.1831 Median :-0.08331 Median :-0.07825
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.6428 3rd Qu.: 0.62319 3rd Qu.: 0.77780
## Max. : 3.3527 Max. : 2.99808 Max. : 2.08439
## Maximum_Open_Credit
## Min. :-1.6458
## 1st Qu.:-0.7416
## Median :-0.2352
## Mean : 0.0000
## 3rd Qu.: 0.5653
## Max. : 2.9126
# Creating and printing Principal Components using function prcomp
# habillage: an optional factor variable for coloring the observations by groups
fviz_pca_ind(prcomp(bankloan_scaled), title='Bank Loan Status Data', habillage=Status_sub, palette='jco', geom='point', ggtheme=theme_classic(), legend='bottom')
# Counting variation captured by each Principal Component
PCA_bankloan <- prcomp(bankloan_scaled)
summary(PCA_bankloan)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.3417 1.0074 0.8657 0.6598
## Proportion of Variance 0.4501 0.2537 0.1874 0.1089
## Cumulative Proportion 0.4501 0.7038 0.8911 1.0000
print(PCA_bankloan) # same as rotation/direction
## Standard deviations (1, .., p=4):
## [1] 1.3417459 1.0074132 0.8657078 0.6598384
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Annual_Income 0.6100994 -0.069384108 -0.4138129 -0.67210379
## Monthly_Debt 0.6311028 0.005781102 -0.2583915 0.73137519
## Months_since_last_delinquent -0.1224274 0.940644848 -0.3162529 -0.01352347
## Maximum_Open_Credit 0.4631410 0.332174192 0.8136196 -0.11482137
PCA_bankloan$center # at the origin
## Annual_Income Monthly_Debt
## 1.694594e-16 1.510019e-16
## Months_since_last_delinquent Maximum_Open_Credit
## 9.095155e-17 5.027807e-17
PCA_bankloan$rotation
## PC1 PC2 PC3 PC4
## Annual_Income 0.6100994 -0.069384108 -0.4138129 -0.67210379
## Monthly_Debt 0.6311028 0.005781102 -0.2583915 0.73137519
## Months_since_last_delinquent -0.1224274 0.940644848 -0.3162529 -0.01352347
## Maximum_Open_Credit 0.4631410 0.332174192 0.8136196 -0.11482137
# Visualization, Counting variation captured by each Principal Component
eigen_exp <- fviz_eig(PCA_bankloan)
eigen_exp
# Visualization, Principal Component as linear combination of each of the variables
eigen_contri1 <- fviz_contrib(PCA_bankloan, choice='var', axes=1)
eigen_contri1
eigen_contri2 <- fviz_contrib(PCA_bankloan, choice='var', axes=2)
eigen_contri2
eigen_contri1_2 <- fviz_contrib(PCA_bankloan, choice='var', axes=1:2)
eigen_contri1_2
# Quantitative
var <- get_pca_var(PCA_bankloan)
head(var$contrib)
## Dim.1 Dim.2 Dim.3 Dim.4
## Annual_Income 37.222123 0.481415442 17.124112 45.17235024
## Monthly_Debt 39.829074 0.003342114 6.676617 53.49096664
## Months_since_last_delinquent 1.498848 88.481273071 10.001591 0.01828842
## Maximum_Open_Credit 21.449956 11.033969373 66.197680 1.31839471
With the balanced data, Charge-Off and Fully-Paid still look overlapping from the cluster. PC1 accounts for 45% of data variation, and PC2 accounts for 25.4% of data variation. This is slightly less than the variation accounted with the original unbalanced data. This is surprising to me. Monthly_Debt and Annual_Income contributes more than 25% for PC1.
# Using balanced data. Taking all quantitative variables
bankloan_sub <- train[1:300, c(2,4,5,6,9,10,11,12,14,15)]
Status_sub <- train[1:300, 'Loan_Status']
# Summarizing the data
summary(bankloan_sub)
## Current_Loan_Amount Credit_Score Annual_Income Years_in_current_job
## Min. : 31108 Min. :648.0 Min. : 284354 Min. : 1.000
## 1st Qu.:177452 1st Qu.:696.8 1st Qu.: 855508 1st Qu.: 2.750
## Median :271865 Median :716.0 Median :1127289 Median : 6.000
## Mean :305396 Mean :712.2 Mean :1218081 Mean : 5.987
## 3rd Qu.:394735 3rd Qu.:733.0 3rd Qu.:1536758 3rd Qu.:10.000
## Max. :786478 Max. :749.0 Max. :2880210 Max. :10.000
## Monthly_Debt Years_of_Credit_History Months_since_last_delinquent
## Min. : 280.6 Min. : 7.90 Min. : 1.00
## 1st Qu.:10369.1 1st Qu.:14.50 1st Qu.:18.00
## Median :15891.0 Median :17.00 Median :35.00
## Mean :16610.6 Mean :18.45 Mean :36.74
## 3rd Qu.:21993.4 3rd Qu.:21.30 3rd Qu.:54.00
## Max. :42506.2 Max. :46.10 Max. :83.00
## Number_of_Open_Accounts Current_Credit_Balance Maximum_Open_Credit
## Min. : 2.00 Min. : 0 Min. : 0
## 1st Qu.: 8.00 1st Qu.:109858 1st Qu.: 240432
## Median :10.00 Median :180405 Median : 375100
## Mean :10.92 Mean :208824 Mean : 437638
## 3rd Qu.:13.00 3rd Qu.:288600 3rd Qu.: 587950
## Max. :32.00 Max. :842593 Max. :1212134
# Scale the data
bankloan_scaled <- scale(bankloan_sub)
summary(bankloan_scaled)
## Current_Loan_Amount Credit_Score Annual_Income Years_in_current_job
## Min. :-1.5968 Min. :-2.6124 Min. :-1.8834 Min. :-1.418733
## 1st Qu.:-0.7449 1st Qu.:-0.6285 1st Qu.:-0.7313 1st Qu.:-0.920849
## Median :-0.1952 Median : 0.1549 Median :-0.1831 Median : 0.003793
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.000000
## 3rd Qu.: 0.5201 3rd Qu.: 0.8468 3rd Qu.: 0.6428 3rd Qu.: 1.141814
## Max. : 2.8008 Max. : 1.4979 Max. : 3.3527 Max. : 1.141814
## Monthly_Debt Years_of_Credit_History Months_since_last_delinquent
## Min. :-1.89061 Min. :-1.7619 Min. :-1.61011
## 1st Qu.:-0.72261 1st Qu.:-0.6594 1st Qu.:-0.84418
## Median :-0.08331 Median :-0.2417 Median :-0.07825
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.62319 3rd Qu.: 0.4766 3rd Qu.: 0.77780
## Max. : 2.99808 Max. : 4.6195 Max. : 2.08439
## Number_of_Open_Accounts Current_Credit_Balance Maximum_Open_Credit
## Min. :-2.0487 Min. :-1.4423 Min. :-1.6458
## 1st Qu.:-0.6707 1st Qu.:-0.6835 1st Qu.:-0.7416
## Median :-0.2113 Median :-0.1963 Median :-0.2352
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4777 3rd Qu.: 0.5510 3rd Qu.: 0.5653
## Max. : 4.8416 Max. : 4.3772 Max. : 2.9126
# Creating and printing Principal Components using function prcomp
# habillage: an optional factor variable for coloring the observations by groups
fviz_pca_ind(prcomp(bankloan_scaled), title='Bank Loan Status Data', habillage=Status_sub, palette='jco', geom='point', ggtheme=theme_classic(), legend='bottom')
# Counting variation captured by each Principal Component
PCA_bankloan <- prcomp(bankloan_scaled)
summary(PCA_bankloan)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7862 1.1417 1.0337 0.97749 0.95933 0.89286 0.87047
## Proportion of Variance 0.3191 0.1303 0.1069 0.09555 0.09203 0.07972 0.07577
## Cumulative Proportion 0.3191 0.4494 0.5563 0.65181 0.74384 0.82356 0.89933
## PC8 PC9 PC10
## Standard deviation 0.65689 0.62685 0.42689
## Proportion of Variance 0.04315 0.03929 0.01822
## Cumulative Proportion 0.94248 0.98178 1.00000
print(PCA_bankloan) # same as rotation/direction
## Standard deviations (1, .., p=10):
## [1] 1.7862493 1.1416702 1.0336827 0.9774934 0.9593270 0.8928578 0.8704715
## [8] 0.6568943 0.6268464 0.4268943
##
## Rotation (n x k) = (10 x 10):
## PC1 PC2 PC3 PC4
## Current_Loan_Amount 0.40701343 -0.1952349 0.23488119 0.133885112
## Credit_Score -0.19309394 0.2471000 -0.70633231 -0.177892866
## Annual_Income 0.33934497 -0.2673582 -0.35350511 0.373712612
## Years_in_current_job 0.16912414 -0.3672567 -0.06053996 -0.057292875
## Monthly_Debt 0.40576503 -0.0664915 -0.22989823 0.384267013
## Years_of_Credit_History 0.20183701 -0.3025326 0.10877495 -0.661801819
## Months_since_last_delinquent -0.03719702 0.4276409 0.45420744 0.310033556
## Number_of_Open_Accounts 0.26241882 0.4694794 -0.16585772 -0.006102478
## Current_Credit_Balance 0.45266551 0.2292062 0.12257523 -0.220384525
## Maximum_Open_Credit 0.41564415 0.3777523 -0.06141242 -0.277696137
## PC5 PC6 PC7 PC8
## Current_Loan_Amount -0.05175622 0.05069197 -0.328710079 -0.71398603
## Credit_Score 0.30859806 -0.23513688 -0.234633903 -0.23517423
## Annual_Income 0.05442396 -0.38245289 0.009252672 -0.08505729
## Years_in_current_job 0.74958516 0.49747815 0.093894358 0.10469261
## Monthly_Debt -0.11325493 -0.11758078 0.223550562 0.39811129
## Years_of_Credit_History 0.04781816 -0.48510199 0.404106290 -0.07577986
## Months_since_last_delinquent 0.55571560 -0.43131846 0.120436130 -0.03961992
## Number_of_Open_Accounts -0.09853217 0.32660982 0.641716849 -0.33484611
## Current_Credit_Balance -0.01344433 -0.02000199 -0.315681161 0.36483591
## Maximum_Open_Credit 0.05853400 0.07827771 -0.297923035 0.07542986
## PC9 PC10
## Current_Loan_Amount -0.316310269 0.001445027
## Credit_Score -0.313783047 0.108458262
## Annual_Income 0.610180369 0.140943219
## Years_in_current_job -0.001624696 0.020741613
## Monthly_Debt -0.577299394 -0.249048829
## Years_of_Credit_History -0.087639247 -0.059888041
## Months_since_last_delinquent -0.039218127 -0.029091854
## Number_of_Open_Accounts 0.080439561 0.190798187
## Current_Credit_Balance -0.054694992 0.665306097
## Maximum_Open_Credit 0.277866761 -0.649957790
PCA_bankloan$center # at the origin
## Current_Loan_Amount Credit_Score
## 8.777701e-17 -2.095060e-15
## Annual_Income Years_in_current_job
## 1.694594e-16 1.250533e-16
## Monthly_Debt Years_of_Credit_History
## 1.510019e-16 1.277855e-16
## Months_since_last_delinquent Number_of_Open_Accounts
## 9.095155e-17 1.324172e-17
## Current_Credit_Balance Maximum_Open_Credit
## 3.416249e-17 5.027807e-17
PCA_bankloan$rotation
## PC1 PC2 PC3 PC4
## Current_Loan_Amount 0.40701343 -0.1952349 0.23488119 0.133885112
## Credit_Score -0.19309394 0.2471000 -0.70633231 -0.177892866
## Annual_Income 0.33934497 -0.2673582 -0.35350511 0.373712612
## Years_in_current_job 0.16912414 -0.3672567 -0.06053996 -0.057292875
## Monthly_Debt 0.40576503 -0.0664915 -0.22989823 0.384267013
## Years_of_Credit_History 0.20183701 -0.3025326 0.10877495 -0.661801819
## Months_since_last_delinquent -0.03719702 0.4276409 0.45420744 0.310033556
## Number_of_Open_Accounts 0.26241882 0.4694794 -0.16585772 -0.006102478
## Current_Credit_Balance 0.45266551 0.2292062 0.12257523 -0.220384525
## Maximum_Open_Credit 0.41564415 0.3777523 -0.06141242 -0.277696137
## PC5 PC6 PC7 PC8
## Current_Loan_Amount -0.05175622 0.05069197 -0.328710079 -0.71398603
## Credit_Score 0.30859806 -0.23513688 -0.234633903 -0.23517423
## Annual_Income 0.05442396 -0.38245289 0.009252672 -0.08505729
## Years_in_current_job 0.74958516 0.49747815 0.093894358 0.10469261
## Monthly_Debt -0.11325493 -0.11758078 0.223550562 0.39811129
## Years_of_Credit_History 0.04781816 -0.48510199 0.404106290 -0.07577986
## Months_since_last_delinquent 0.55571560 -0.43131846 0.120436130 -0.03961992
## Number_of_Open_Accounts -0.09853217 0.32660982 0.641716849 -0.33484611
## Current_Credit_Balance -0.01344433 -0.02000199 -0.315681161 0.36483591
## Maximum_Open_Credit 0.05853400 0.07827771 -0.297923035 0.07542986
## PC9 PC10
## Current_Loan_Amount -0.316310269 0.001445027
## Credit_Score -0.313783047 0.108458262
## Annual_Income 0.610180369 0.140943219
## Years_in_current_job -0.001624696 0.020741613
## Monthly_Debt -0.577299394 -0.249048829
## Years_of_Credit_History -0.087639247 -0.059888041
## Months_since_last_delinquent -0.039218127 -0.029091854
## Number_of_Open_Accounts 0.080439561 0.190798187
## Current_Credit_Balance -0.054694992 0.665306097
## Maximum_Open_Credit 0.277866761 -0.649957790
# Visualization, Counting variation captured by each Principal Component
eigen_exp <- fviz_eig(PCA_bankloan)
eigen_exp
# Visualization, Principal Component as linear combination of each of the variables
eigen_contri1 <- fviz_contrib(PCA_bankloan, choice='var', axes=1)
eigen_contri1
eigen_contri2 <- fviz_contrib(PCA_bankloan, choice='var', axes=2)
eigen_contri2
eigen_contri1_2 <- fviz_contrib(PCA_bankloan, choice='var', axes=1:2)
eigen_contri1_2
# Quantitative
var <- get_pca_var(PCA_bankloan)
head(var$contrib)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Current_Loan_Amount 16.565993 3.8116650 5.5169175 1.7925223 0.2678706
## Credit_Score 3.728527 6.1058423 49.8905338 3.1645872 9.5232763
## Annual_Income 11.515501 7.1480404 12.4965862 13.9661116 0.2961967
## Years_in_current_job 2.860297 13.4877493 0.3665087 0.3282474 56.1877916
## Monthly_Debt 16.464526 0.4421119 5.2853197 14.7661137 1.2826679
## Years_of_Credit_History 4.073818 9.1525972 1.1831989 43.7981648 0.2286577
## Dim.6 Dim.7 Dim.8 Dim.9
## Current_Loan_Amount 0.2569676 10.805031600 50.9776050 1.000522e+01
## Credit_Score 5.5289353 5.505306831 5.5306919 9.845980e+00
## Annual_Income 14.6270215 0.008561193 0.7234742 3.723201e+01
## Years_in_current_job 24.7484510 0.881615052 1.0960542 2.639638e-04
## Monthly_Debt 1.3825241 4.997485369 15.8492601 3.332746e+01
## Years_of_Credit_History 23.5323940 16.330189373 0.5742587 7.680638e-01
## Dim.10
## Current_Loan_Amount 0.0002088102
## Credit_Score 1.1763194542
## Annual_Income 1.9864990910
## Years_in_current_job 0.0430214491
## Monthly_Debt 6.2025319326
## Years_of_Credit_History 0.3586577477
With all quantitative variables, we need to include PC1-PC6 in order to account for >80% variation.
# Using balanced data
# Taking only 300 observation because the data is too big to visualize
bankloan_sub <- train[1:300, c('Annual_Income', 'Monthly_Debt', 'Months_since_last_delinquent', 'Maximum_Open_Credit')] # 'Maximum_Open_Credit', 'Credit_Score',
# Scale the data
bankloan_scaled <- scale(bankloan_sub)
#summary(bankloan_scaled)
# Elbow Method
elbow_figure <- fviz_nbclust(bankloan_scaled, kmeans, method='wss')
elbow_figure
# Silhouette Score
sil <- fviz_nbclust(bankloan_scaled, kmeans, method='silhouette')
sil
# Gap Statistics
gap_stat <- clusGap(bankloan_scaled, FUN=kmeans, nstart=25, K.max=10, B=50)
fviz_gap_stat(gap_stat)
# Applying kMeans
kmeans_clust <- kmeans(bankloan_scaled, 2)
# Visualize Cluster Plot
fviz_cluster(list(data=bankloan_scaled, cluster=kmeans_clust$cluster), ellipse.type='norm', geom='point', stand=FALSE, palette='jco')
# Calculating hopkins statistics which show if the data exhibit inherent patterns
print(hopkins(bankloan_scaled, n=nrow(bankloan_scaled)-1))
## $H
## [1] 0.3729719
# Visualizing the dissimilarity matrix
# Pink/red is high in similarity, blue/purple is low in similarity.
fviz_dist(dist(bankloan_scaled), show_labels = FALSE) +
labs(title='Bank Loan Status Dataset')
The Hopkins Statistics is 0.37. This indicates data do not have good separability, but rather random. This agrees with the PCA clustering in the previous section. However, supervised algorithms like k-Nearest Neighbors and Random Forest give 98%-100% accuracy. This is probably this data work well for supervised algorithms but not so great on unsupervised algorithms like PCA and k-Means. Also categorical variables were included in the supervised algorithms.
In terms of predicting number of clusters, both Elbow Method and Silhouhette Score predict 2 is optimal, while Gap predicted 1, which means all data are the same group and this would not make sense.
# Taking a subset of quantitative variables for easier showing the concepts. Taking only 300 observation of the balanced data because the data is too big to visualize
bankloan_sub <- train[1:300, c('Annual_Income', 'Monthly_Debt', 'Months_since_last_delinquent', 'Maximum_Open_Credit')] # 'Maximum_Open_Credit', 'Credit_Score',
# Scale the data
bankloan_scaled <- scale(bankloan_sub)
#summary(bankloan_scaled)
# Creating dissimilarity matrix suing Euclidean distance
bankloan_dist <- dist(bankloan_scaled, method='euclidean')
# Linkage function utilizes the distance as a proximity metric and pair wise merges the instances thereby creating larger clusters with every successive iteration. Using linkage function ward 2 that creates clusters by minimizing variance.
# https://en.wikipedia.org/wiki/Ward%27s_method
agg_tree_ward <- hclust(d=bankloan_dist, method='ward.D2')
print(agg_tree_ward)
##
## Call:
## hclust(d = bankloan_dist, method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 300
# Visualizing Dendrogram
fviz_dend(agg_tree_ward, cex=.5)
# Cutting tree to create 2 clusters and visualizing it
agg_tree_ward_dend <- fviz_dend(agg_tree_ward, cex=.5, k=2, palette='jco')
agg_tree_ward_dend
# To access the partition accuracy of the cluster tree (created by hclust()) there should be a strong correlation between the original distance matrix and the object linkage distance defined as Cophenetic Distances.
# Cophenetic correlation coefficient measures of how faithfully a dendogram preserves the pairwise distances between the original data points.
agg_cophenetic <- cophenetic(agg_tree_ward)
cor(bankloan_dist, agg_cophenetic)
## [1] 0.5038476
Using the Ward Linkage function, I get Cophenetic correlation coefficient of 0.5. I used balanced data for this, but the cluster Dendrogram show one cluster is twice as big as the other.
agg_tree_avg <- hclust(d=bankloan_dist, method='average')
print(agg_tree_avg)
##
## Call:
## hclust(d = bankloan_dist, method = "average")
##
## Cluster method : average
## Distance : euclidean
## Number of objects: 300
# Visualizing Dendrogram
fviz_dend(agg_tree_avg, cex=.5)
# Cutting tree to create 2 clusters and visualizing it
agg_tree_avg_dend <- fviz_dend(agg_tree_avg, cex=.5, k=2, palette='jco')
agg_tree_avg_dend
# Cophenetic Distance
agg_cophenetic <- cophenetic(agg_tree_avg)
# Correlation between Cophenetic distances and original distances
cor(bankloan_dist, agg_cophenetic)
## [1] 0.6559579
# Cutting tree into two clusters and Visualizing it
two_groups <- cutree(agg_tree_avg, k=2)
fviz_cluster(list(data=bankloan_scaled, cluster=two_groups))
Using the Average Linkage function, I get Cophenetic correlation coefficient of 0.65, which is better than the Ward Linkage function. However, the dendrogram and cluster plot show that one cluster is much larger than the other. This is not true because I was using balanced data. Therefore, Average Linkage function does not work well here.
# Using the Complete (i.e. Maximum) linkage
agg_tree_max <- hclust(d=bankloan_dist, method='complete')
print(agg_tree_max)
##
## Call:
## hclust(d = bankloan_dist, method = "complete")
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 300
# Visualizing Dendrogram
fviz_dend(agg_tree_max, cex=.5)
# Cutting tree to create 2 clusters and visualizing it
agg_tree_max_dend <- fviz_dend(agg_tree_max, cex=.5, k=2, palette='jco')
agg_tree_max_dend
# Cophenetic Distance
agg_cophenetic <- cophenetic(agg_tree_max)
# Correlation between Cophenetic distances and original distances
cor(bankloan_dist, agg_cophenetic)
## [1] 0.5799617
Similar to Average Linkage, using Complete/Maximum Linkage function, the dendrogram shows that one cluster is much larger than the other. This is not true because I was using balanced data. Therefore, the Ward Linkage function works best for my data.
# Agglomerrative (Bottom-up approach)
agnes_cluster <- agnes(x=bankloan_scaled, stand=TRUE, metric='euclidean', method='average')
# Plune tree
#agnes_tree <- pltree(agnes_cluster, cex=.5, hang=-1, main='Dendrogram of Agnes')
fviz_dend(agnes_cluster, cex=.6, k=2, main='Dendrogram of Agnes')
# Divisive (Top-down approach). Divisive approach does not need linkage function
diana_cluster <- diana(x=bankloan_scaled, stand=TRUE, metric='euclidean')
fviz_dend(diana_cluster, cex=.6, k=2, main='Dendrogram of Diana')
Using Cluster Package, Divisive (top-down) gives better cluster than Agglomerrative (bottom-up) method by looking at the size balance of the two cluster.
library(pheatmap)
knitr::kable(str(bankloan_scaled))
## num [1:300, 1:4] -0.722 0.166 0.101 0.902 -0.549 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:300] "179" "526" "195" "118" ...
## ..$ : chr [1:4] "Annual_Income" "Monthly_Debt" "Months_since_last_delinquent" "Maximum_Open_Credit"
## - attr(*, "scaled:center")= Named num [1:4] 1.22e+06 1.66e+04 3.67e+01 4.38e+05
## ..- attr(*, "names")= chr [1:4] "Annual_Income" "Monthly_Debt" "Months_since_last_delinquent" "Maximum_Open_Credit"
## - attr(*, "scaled:scale")= Named num [1:4] 495765 8637.4 22.2 265914.4
## ..- attr(*, "names")= chr [1:4] "Annual_Income" "Monthly_Debt" "Months_since_last_delinquent" "Maximum_Open_Credit"
|| || || ||
knitr::kable(summary(bankloan_scaled))
| Annual_Income | Monthly_Debt | Months_since_last_delinquent | Maximum_Open_Credit | |
|---|---|---|---|---|
| Min. :-1.8834 | Min. :-1.89061 | Min. :-1.61011 | Min. :-1.6458 | |
| 1st Qu.:-0.7313 | 1st Qu.:-0.72261 | 1st Qu.:-0.84418 | 1st Qu.:-0.7416 | |
| Median :-0.1831 | Median :-0.08331 | Median :-0.07825 | Median :-0.2352 | |
| Mean : 0.0000 | Mean : 0.00000 | Mean : 0.00000 | Mean : 0.0000 | |
| 3rd Qu.: 0.6428 | 3rd Qu.: 0.62319 | 3rd Qu.: 0.77780 | 3rd Qu.: 0.5653 | |
| Max. : 3.3527 | Max. : 2.99808 | Max. : 2.08439 | Max. : 2.9126 |
# Heatmaps are used for Visualizing Hierarchical clustering.
# Heat Maps are used to visualize clusters of samples and features. The high values are in red and low in blue.
# cellheight can be set to a higher number to see the details of each row (such as representing a gene)
pheatmap(bankloan_scaled, cutree_rows = 2, cellheight = .4)
clValid reports validation measures for clustering results. The function returns an object of class “’>clValid”, which contains the clustering results in addition to the validation measures. The validation measures fall into three general categories: “internal”, “stability”, and “biological”.
library(clValid)
bankloan_sub <- train[1:300, c('Annual_Income', 'Monthly_Debt', 'Months_since_last_delinquent', 'Maximum_Open_Credit')] # 'Maximum_Open_Credit', 'Credit_Score',
# Scale the data
bankloan_scaled <- scale(bankloan_sub)
knitr::kable(summary(bankloan_scaled))
| Annual_Income | Monthly_Debt | Months_since_last_delinquent | Maximum_Open_Credit | |
|---|---|---|---|---|
| Min. :-1.8834 | Min. :-1.89061 | Min. :-1.61011 | Min. :-1.6458 | |
| 1st Qu.:-0.7313 | 1st Qu.:-0.72261 | 1st Qu.:-0.84418 | 1st Qu.:-0.7416 | |
| Median :-0.1831 | Median :-0.08331 | Median :-0.07825 | Median :-0.2352 | |
| Mean : 0.0000 | Mean : 0.00000 | Mean : 0.00000 | Mean : 0.0000 | |
| 3rd Qu.: 0.6428 | 3rd Qu.: 0.62319 | 3rd Qu.: 0.77780 | 3rd Qu.: 0.5653 | |
| Max. : 3.3527 | Max. : 2.99808 | Max. : 2.08439 | Max. : 2.9126 |
# Comparing Hierarchical, k-Means and PAM (Partitioning Around Mediods -- medians) algorithms
clmethods <- c('hierarchical', 'kmeans', 'pam')
# Using 'internal' as evaluation metric
evaluation <- clValid(bankloan_scaled, nClust=2:6, clMethods=clmethods, validation='internal')
summary(evaluation)
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 2 3 4 5 6
##
## Validation Measures:
## 2 3 4 5 6
##
## hierarchical Connectivity 9.0060 50.4532 53.1821 67.8536 69.2532
## Dunn 0.1456 0.0880 0.0880 0.1051 0.1051
## Silhouette 0.3627 0.2592 0.1871 0.1814 0.1733
## kmeans Connectivity 67.9052 113.3679 108.8123 131.1357 167.3623
## Dunn 0.0694 0.0516 0.0677 0.0763 0.0493
## Silhouette 0.2841 0.2388 0.2477 0.2478 0.2338
## pam Connectivity 78.1131 107.1226 159.7254 179.6643 205.6044
## Dunn 0.0683 0.0563 0.0513 0.0649 0.0624
## Silhouette 0.2552 0.2354 0.1921 0.2115 0.1873
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 9.0060 hierarchical 2
## Dunn 0.1456 hierarchical 2
## Silhouette 0.3627 hierarchical 2
optimalScores(evaluation)
## Score Method Clusters
## Connectivity 9.0059524 hierarchical 2
## Dunn 0.1456297 hierarchical 2
## Silhouette 0.3627021 hierarchical 2
# Visualization
plot(evaluation)
Comparing k-Means, Hierarchical, and PAM (Partitioning Around Mediods – medians) algorithms, Hierarchical clustering gives the best Internal validation measures in terms of connectivity (want minimum), Dunn index (want as close to 1) and Silhouette Score (want as close to 1).
# Using 'stability' as validation measure
evaluation <- clValid(bankloan_scaled, nClust=2:6, clMethods=clmethods, validation='stability')
summary(evaluation)
##
## Clustering Methods:
## hierarchical kmeans pam
##
## Cluster sizes:
## 2 3 4 5 6
##
## Validation Measures:
## 2 3 4 5 6
##
## hierarchical APN 0.0079 0.0960 0.2284 0.2976 0.3032
## AD 2.5664 2.4465 2.3417 2.2985 2.2746
## ADM 0.1023 0.7024 0.6147 0.7919 0.7977
## FOM 0.9811 0.9617 0.9399 0.9366 0.9338
## kmeans APN 0.1620 0.4120 0.3055 0.3207 0.4384
## AD 2.3116 2.2991 2.0601 1.9948 1.9841
## ADM 0.4041 0.9948 0.7263 0.7808 0.9347
## FOM 0.9414 0.9503 0.9350 0.9310 0.9298
## pam APN 0.2103 0.2950 0.4282 0.4625 0.4973
## AD 2.3402 2.1755 2.1488 2.0755 2.0263
## ADM 0.5056 0.6582 0.9623 1.0007 1.0218
## FOM 0.9445 0.9475 0.9359 0.9459 0.9343
##
## Optimal Scores:
##
## Score Method Clusters
## APN 0.0079 hierarchical 2
## AD 1.9841 kmeans 6
## ADM 0.1023 hierarchical 2
## FOM 0.9298 kmeans 6
optimalScores(evaluation)
## Score Method Clusters
## APN 0.007888128 hierarchical 2
## AD 1.984122062 kmeans 6
## ADM 0.102312826 hierarchical 2
## FOM 0.929843323 kmeans 6
# Visualization
plot(evaluation)
Comparing k-Means, Hierarchical, and PAM (Partitioning Around Mediods – medians) algorithms, Hierarchical gives the best stability measures in terms of APN and ADN. It predicts 2 cluster is the optimal number. k-Means gives the best stability in terms of AD and FOM. It predicts 6 clusters is the optimal number. So Hierarchical is a better prediction, since I know the data has only two classes.
set.seed(9650)
# Divide data into train and test sets
indexTrain <- createDataPartition(y=bankloan$Loan_Status, p=0.75, list=FALSE)
training <- bankloan[indexTrain, ]
testing <- bankloan[-indexTrain, ]
table(training$Loan_Status)
##
## Charged Off Fully Paid
## 3471 15418
# ROSE resampling to balance data
down_train <- downSample(x=training[, -ncol(training)], y=training$Loan_Status)
down_test <- downSample(x=testing[, -ncol(testing)], y=testing$Loan_Status)
table(down_train$Loan_Status)
##
## Charged Off Fully Paid
## 3471 3471
table(down_test$Loan_Status)
##
## Charged Off Fully Paid
## 1157 1157
# cross validation and fitting model
fitControl <- trainControl(method='repeatedcv', number=10)
down_kNNfit <- train(Loan_Status~., data=down_train, method='knn', trControl=fitControl, preProcess=c('BoxCox', "center", "scale"), tuneLength=12)
#print output
down_kNNfit
## k-Nearest Neighbors
##
## 6942 samples
## 16 predictor
## 2 classes: 'Charged Off', 'Fully Paid'
##
## Pre-processing: Box-Cox transformation (6), centered (32), scaled (32)
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 6248, 6248, 6248, 6248, 6248, 6248, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9752236 0.9504474
## 7 0.9775274 0.9550549
## 9 0.9783928 0.9567858
## 11 0.9794000 0.9588000
## 13 0.9801202 0.9602406
## 15 0.9815605 0.9631212
## 17 0.9815603 0.9631207
## 19 0.9825686 0.9651372
## 21 0.9824251 0.9648502
## 23 0.9814163 0.9628325
## 25 0.9825692 0.9651384
## 27 0.9825696 0.9651393
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 27.
plot(down_kNNfit)
# Prediction
down_kNNpredict <- predict(down_kNNfit, newdata=down_test)
# confusion matrix to see accuracy and other parameter values
confusionMatrix(down_kNNpredict, down_test$Loan_Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Fully Paid
## Charged Off 1129 19
## Fully Paid 28 1138
##
## Accuracy : 0.9797
## 95% CI : (0.9731, 0.985)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9594
##
## Mcnemar's Test P-Value : 0.2432
##
## Sensitivity : 0.9758
## Specificity : 0.9836
## Pos Pred Value : 0.9834
## Neg Pred Value : 0.9760
## Prevalence : 0.5000
## Detection Rate : 0.4879
## Detection Prevalence : 0.4961
## Balanced Accuracy : 0.9797
##
## 'Positive' Class : Charged Off
##
# calculate accuracy
mean(down_kNNpredict == down_test$Loan_Status)
## [1] 0.9796889
set.seed(9650)
# Number of variables randomly sampled as candidates at each split.
mtry = sqrt(ncol(training))
# setting the mtry value
tunegrid <- expand.grid(mtry=mtry)
# Cross Validation for parameter tuning
fitControl <- trainControl(method='repeatedcv', number=10, repeats=1)
# train the model
RFfit_down <- train(Loan_Status~., data=down_train, method='rf', ntree=100, tunegrid=tunegrid, trControl=fitControl, preProcess=c('BoxCox', "center", "scale"))
#print output
RFfit_down
## Random Forest
##
## 6942 samples
## 16 predictor
## 2 classes: 'Charged Off', 'Fully Paid'
##
## Pre-processing: Box-Cox transformation (6), centered (32), scaled (32)
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 6248, 6248, 6248, 6248, 6248, 6247, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9989918 0.9979835
## 17 1.0000000 1.0000000
## 32 1.0000000 1.0000000
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 17.
plot(RFfit_down)
# Prediction
RFpredict_down <- predict(RFfit_down, newdata=down_test)
# confusion matrix to see accuracy and other parameter values
confusionMatrix(RFpredict_down, down_test$Loan_Status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Charged Off Fully Paid
## Charged Off 1157 0
## Fully Paid 0 1157
##
## Accuracy : 1
## 95% CI : (0.9984, 1)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0
## Specificity : 1.0
## Pos Pred Value : 1.0
## Neg Pred Value : 1.0
## Prevalence : 0.5
## Detection Rate : 0.5
## Detection Prevalence : 0.5
## Balanced Accuracy : 1.0
##
## 'Positive' Class : Charged Off
##
# compute accuracy by comparing prediction with actual classification
mean(RFpredict_down == down_test$Loan_Status)
## [1] 1
set.seed(9650)
# Original data is factor or integers. Have to convert it into numeric or double for t-test comparison to work
kNNpredict <- as.numeric(down_kNNpredict)
RFpredict <- as.numeric(RFpredict_down)
typeof(kNNpredict)
## [1] "double"
typeof(RFpredict_down)
## [1] "integer"
summary(kNNpredict)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 1.504 2.000 2.000
summary(RFpredict_down)
## Charged Off Fully Paid
## 1157 1157
# Paired t-test is applied, because they are the same observations and analyzed using different algorithms.
# alternative='less' propose that kNN is less accurate than Random Forest
# Testing the prediction
t.test(kNNpredict, RFpredict, alternative = 'less', paired = TRUE, conf.level = 0.95)
##
## Paired t-test
##
## data: kNNpredict and RFpredict
## t = 1.313, df = 2313, p-value = 0.9053
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 0.008763743
## sample estimates:
## mean of the differences
## 0.003889369
# alternative = 'two.sided', propose that two algorithms are different
t.test(kNNpredict, RFpredict, alternative = 'two.sided', paired = TRUE, conf.level = 0.95)
##
## Paired t-test
##
## data: kNNpredict and RFpredict
## t = 1.313, df = 2313, p-value = 0.1893
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.001919520 0.009698258
## sample estimates:
## mean of the differences
## 0.003889369
# Testing the model
#t.test(down_kNNfit, RFfit_down, alternative = 'less', paired = TRUE, conf.level = 0.95)
summary(down_kNNfit)
## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 32 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
summary(RFfit_down)
## Length Class Mode
## call 6 -none- call
## type 1 -none- character
## predicted 6942 factor numeric
## err.rate 300 -none- numeric
## confusion 6 -none- numeric
## votes 13884 matrix numeric
## oob.times 6942 -none- numeric
## classes 2 -none- character
## importance 32 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 6942 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 32 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 2 -none- list
# Can not compare two models because they have different parameters.
This section is a hypothesis testing to see if k-NN and Random Forest the different performance is statically significant or is it by chance.
p-value = 0.9 (>0.05) for “k-NN has lower performance than RF”. Therefore, data do not provide enough evidence that k-NN has lower performance than RF.
p-value = 0.19 (>0.05) for “k-NN has different performance than RF”. Therefore, data do not provide enough evidence that k-NN has different performance than RF.
Taken together, the difference of k-NN performance accuracy of 98% and RF performance accuracy of 100% is by chance and not stastically significant.
Unsupervised algorithms are more challenging than supervised algorithm because there is no target feature to train the model. That is the reason k-NN and Random Forest gives 98%-100% accuracy while PCA, k-Means and Hierarchical Clustering gives much less accurate predictions on this bank loan status dataset.
Nonetheless, Principal Components PC1 and PC2 combined accounts for 71% variability in the data. Hierarchical clustering gives better internal validation and stability measures than k-Means and PAM. Of the Hierarchical clustering, Divisive method gives better result than agglomerrative approach in terms of balance of the two cluster sizes given balanced data were used in the model. Silhouette score and Elbow method both predict 2 clusters, while Gap predicted 1 wrongly.
Hypothesis testing shows that the difference of k-NN performance accuracy of 98% and RF performance accuracy of 100% is by chance and the difference is not statically significant. This is interesting to know.
A weakness of this dataset is that the two clusters do not have good separability (Hopkins Statistics is 0.37). This is probably because the data is noisy or data is not ideal for unsupervised algorithms. However, I cannot change the inherent nature/pattern of the data. So PAC, k-Means and Hierarchical clustering analyses were carried out even though the data is not ideal.
In the future, it would be helpful to find or develop an unsupervised algorithm that can improve the prediction result, because the prediction was excellent when providing with a target variable.